The grammar of graphics is a framework for constructing statistical graphics in a principled way.
The grammr defines an explicit relationship between the variables in your data you have and the graphic or plot you wish to represent. It was first defined by Lee Wilkinson and extended by Hadley Wickham in the R package ggplot2.
Throughout this section, you uncover how the charts you know by name can be created via ggplot2, learn how to construct a ggplot2 graphic and then build a plot up, layer-by-layer.
The grammar of graphics allows you to define the mapping between variables in the data, with elements of the plot. It allows us to see and understand how plots are similar or different.
The grammar also helps you see how variations in the definition create variations in the plot. Using named plots, for example a pie chart, bar chart, scatterplot, in some ways is like seeing animals in the zoo.
To showcase ggplot2 you will build graphics using a data set provided by the World Health Organisation (WHO). This is current tuberculosis (TB) case notifications aggregated by country, gender and age group. We have tidied up this data for you and saved it in the R data storage format.
Before making your way through this step, we strongly recommend that you download the data set and then place it in your project folder on your computer.
The data consists of six columns:
On your computer, open RStudio and then load the TB data in with R using the tidyverse:
library(tidyverse)
tb <- read_rds("data/tb_tidy.rds")
tb
## # A tibble: 47,866 x 6
## country iso3 year count gender age_group
## <chr> <chr> <dbl> <dbl> <fct> <fct>
## 1 Afghanistan AFG 1997 10 M 15-24
## 2 Afghanistan AFG 1998 129 M 15-24
## 3 Afghanistan AFG 1999 55 M 15-24
## 4 Afghanistan AFG 2000 228 M 15-24
## 5 Afghanistan AFG 2001 379 M 15-24
## 6 Afghanistan AFG 2002 476 M 15-24
## 7 Afghanistan AFG 2003 511 M 15-24
## 8 Afghanistan AFG 2004 537 M 15-24
## 9 Afghanistan AFG 2005 606 M 15-24
## 10 Afghanistan AFG 2006 837 M 15-24
## # … with 47,856 more rows
For the examples, throughout Week 2 you will use a subset of this data corresponding to TB cases in Australia
tb_au <- filter(tb, country == "Australia")
tb_au
That’s it for now, but in the next section of the course you will create charts from the TB data to find out if there are differences between age groups, and gender between TB cases in Australia.
The end result is a 100% chart, that compares the proportion of TB cases in each age group by gender. Your finished chart should resemble the following:
There are three key learning from this chart:
The first code that’s used when making any ggplot2 graphic is the function ggplot(). Usually, you pass a data set and an aesthetic mapping as arguments to the function.
An aesthetic mapping which we also sometimes call a variable mapping is constructed with the function aes(). With aes() we are telling ggplot() which variable in our data should correspond to a particular graphical element to be drawn.
Consider the following example:
p <- ggplot(tb_au, aes(x = year, y = count, fill = gender))
The first argument provided is the name of the data, tb_au. And the variable mapping consists of three elements, the variable:
The result is saved to an object called p. What does it look like if you print it?
p
You have a blank plot! The year and count have been mapped to the axes, although there’s no fill.
In order to draw something we need a geom, this refers to the geometrical shape (think a point, or a rectangle) that will be drawn.
Draw bars, so you can add geom_bar() to your plot object p.
p <- p + geom_bar(stat = "identity", position = "fill")
When new grammar elements are added to a plot object using ggplot2, they are included using the + operator.
Here, you have added geom_bar() with two arguments stat = "identity" and position = "fill". The former refers to any transformation that will be applied to the count column.
You have already counted how many TB incidences are in each combination of categories, so stat = "identity" says no need to compute the count setting. The latter refers to how the bars are placed. We are mostly interested in proportions between gender, over years, separately by age. The position = "fill" option in geom_bar sets the heights of the bars to be all at 100%. It ignores counts, and emphasizes the proportion of males and females.
Let’s show the result:
p
Now we have bars, that are filled by the proportion of male and female TB cases for each year. You might have noticed that ggplot automatically draws a legend. You may want might want to consider changing the colour of the bars to accommodate colour blindness. You can do this by adding a scale element.
p <- p + scale_fill_brewer(palette = "Dark2")
p
For this exercise, modify the fill colours by using the colour palettes from Cynthia Brewer. The argument palette = "Dark2" refers to the name of the colour palette. You can see all of the colour brewer palettes on colorbrewer.org.
The final step is create a bar chart for each age group present in the TB notifactions, which you can do by adding a facet element.
p <- p + facet_grid(. ~ age_group)
p
A facet creates subplots for each category (or combination of categories) laid out in a grid. The argument that constructs the facets is . ~ age_group.
This is called a formula, and is of the form left hand side ~ right hand side. The ~ is called a tilde, and is usually located on on the top left of most keyboards. Depending on how you place variables with a formula, it changes how the subplots are laid out on the screen.
p <- p + facet_grid( age_group ~ .)
p
To summarise, you have created your first ggplot one layer at a time. Once you become more confident you will be able to start building everything up at once. Your completed chart should resemble the following:
p <- ggplot(tb_au, aes(x = year, y = count, fill = gender)) +
geom_bar(stat = "identity", position = "fill") +
facet_grid(~ age_group) +
scale_fill_brewer(palette="Dark2")
p
Bar charts are best used to show ‘counts’ rather than ‘proportions’.
Consider the following sample chart, which shows incidents of tuberculosis (TB).
The focus of the chart is on counts in each category. It shows that counts are different across ages, and years: counts tend to be lower in middle age (45-64).
It also shows that in the year 1999, there was a bit of a TB outbreak in most age groups, with numbers doubling or tripling compared to other years. The TB Incidence has been increasing among younger age groups in recent years.
How is the sample chart on this step different from the 100% chart you produced?
Continue to develop your skills in developing graphic plots by creating your own bar chart. If you haven’t already, open RStudio on your computer and load the TB data. For this exercise you will also need to load the tidyverse package:
library(tidyverse)
tb <- read_rds("data/tb_tidy.rds")
tb_au <- filter(tb,
country == "Australia",
!is.na(age_group))
Once you’ve installed the data and tidyverse, return to this step and then follow along.
Examine the bar chart. Are the variable mappings the same or different compared to our previous choice, and have the colours changed?
In the bar chart we have the same facets, colours and aesthetics, but what’s changed in the way the bars have been positioned?
Instead of filling them up to 100%, they’re stacked based on their counts.
For your own bar chart, you can achieve this using ggplot2 by changing the position argument in geom_bar() to stack, with the following code chunk:
p <- ggplot(tb_au, aes(x = year, y = count, fill = gender)) +
geom_bar(stat = "identity", position = "stack") +
facet_grid(~ age_group) +
scale_fill_brewer(palette="Dark2")
p
What if you wanted to show the counts in each age category but with respect to gender, as shown in the following chart:
The focus is now on counts by gender, and shows predominantly male incidence of TB. You can also see that incidence among males relative to females is from middle age onwards. There appears to be similar incidence between males and females in younger age groups.
Again, let’s think about what’s changed from your previous charts. The colours, aesthetics and facets have remained the same, it’s just the position of the bars that have been altered
For your own bar chart, you can achieve this using ggplot2 by changing the position argument in geom_bar() to dodge, with the following code chunk:
p <- ggplot(tb_au, aes(x = year, y = count, fill = gender)) +
geom_bar(stat = "identity", position = "dodge") +
facet_grid(~ age_group) +
scale_fill_brewer(palette="Dark2")
p
You also might be interested in looking at how the incidence changes within a gender over time and age groups, as shown in the following chart:
It’s now easier to focus separately on males and females. You can see that the 1999 outbreak of TB mostly affected males. There is also growing incidence in the 25-34 age group is still affecting females which seems to be have stablised for males.
You may have noticed you are facetting on additional variable: gender. You also no longer require a position argument, since there is only one bar per category.
p <- ggplot(tb_au, aes(x = year, y = count, fill = gender)) +
geom_bar(stat = "identity") +
facet_grid(gender ~ age_group) +
scale_fill_brewer(palette="Dark2")
p
Rainbow, rose and pie charts are other types of graphical plot you can use to visualise your data, and with a few changes to your ggplot2 code you can make a drastically different chart.
Consider the following rainbow chart:
The chart is visually appealling, but it’s difficult to interpret. In what way could use ggplot2 to make the chart easier to interpet?
Notice how the aesthetic mappings are different, year is now mapped to the colour of the bar and a single number (not a variable) is mapped to x, that results in a single stacked bar chart. You can change year to be a category rather than a number (by using factor()):
rainbow <- ggplot(tb_au, aes(x = 1, y = count, fill = factor(year))) +
geom_bar(stat = "identity", position="fill") +
facet_grid(gender ~ age_group)
rainbow
You can further manipulate the chart by changing the cartesian coordinates to polar which will then lay everything out in circles, rather than rectangles.
When you use polar coordinates, one axis is mapped to an angle, rather than a position, as shown in the following rose chart:
The year (which was laid out on the x-axis) now corresponds to an angle around a circle going clockwise, rather than a straight line.
This layout makes it easier to interpret how the chart emphasises the middle age groups as having low incidence accross the years.
The code looks the same as the separate bar chart example, with an additional specification coord_polar():
p <- ggplot(tb_au, aes(x = year, y = count, fill = gender)) +
geom_bar(stat = "identity") +
facet_grid(gender ~ age_group) +
scale_fill_brewer(palette="Dark2") +
coord_polar()
p
You can continue to use polar coordinates to modify your rainbow chart and transform it to a pie chart.
The code is the same as before, but this time you want to have the y-axis represent the angle around the circle. You can do this with coord_polar(theta = "y"):
rainbow +
coord_polar(theta = "y")
By doing this, you get another pretty chart that’s difficult to interpret and unuseful for making comparisons across age groups.
Proximity is a a really powerful tool. Knowing how to effectively use facet, and colour to map different variables, will improve your data plots, out of this world!
- Not what we want
- Same information as previous
- We'd like to directly compare male femal for each year, and age, side-by-side bar chart
We want to focus on differences between sexes, and alternatively also differences between ages. To make it easier, we’ll reduce the time period to just one year, 2012.
tb_au %>% filter(year == 2012) %>%
ggplot(aes(x=gender, y=count, fill=gender)) +
geom_bar(stat="identity", position="dodge") +
facet_wrap(~age_group, ncol=6) +
scale_fill_brewer("", palette="Dark2") +
ggtitle("Arrangement A")
tb_au %>% filter(year == 2012) %>%
ggplot(aes(x=age_group, y=count, fill=age_group)) +
geom_bar(stat="identity", position="dodge") +
facet_wrap(~gender, ncol=6) +
scale_fill_brewer("", palette="Dark2") +
ggtitle("Arrangement B")
We’ve got two different rearrangements of the same information.
What do we learn? That is different from each? What’s the focus of each? What’s easy, what’s harder?
- Arrangement A makes it easier to directly compare male and female counts, separately for each age group. Generally, male counts are higher than female counts. There is a big difference between counts in the 45-54 age group, and over 65 counts are almost the same.
- Arrangement B makes it easier to directly compare counts by age group, separately for females and males. For females, incidence drops in the middle years. For males, it is pretty consistently high across age groups.
Extend the last plot to examine two years, 2008 and 2012.
What I love about ggplot2…
Source: https://www.wired.com/wp-content/uploads/2016/01/DB-Transformation-Colour.gif
…you can make so many different displays of the same data, and each time something new may emerge.